A STUDY OF BANK CUSTOMER CHURN
¶
Description of each variable:
- Credit_score — credit rating
- geography — client country
- gender — client gender
- age — client age
- tenure — number of years spent by the client with the bank
- balance — client account balance
- num_of_product — number of products that the client purchased from the bank
- has_cr_card — the client has a credit card
- is-_active_member — active client
- estimated_salary — clien salary
- exited — client left the bank
# load libraries
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt #plt.xkcd()
from IPython.display import set_matplotlib_formats
import matplotlib.ticker as mtick
from sklearn.preprocessing import LabelEncoder
from utilities import *
import scipy.stats as ss
from scipy.stats import ttest_ind
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from itertools import combinations
import warnings
warnings.filterwarnings("ignore")
# Improve quality of the plots
set_matplotlib_formats('svg')
# Set the style of the plot
plt.style.use('seaborn-v0_8-white')
#plt.xkcd()
1-Load and clean the data¶
# Load the data
df = pd.read_csv("churn.csv", index_col=False)
df.head(5)
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
- At least 3 columns do not provide training value : RowNumber, CustomerId and Surname
col_name = ['credit_score', 'geography', 'gender', 'age', 'tenure', 'balance',
'num_of_products', 'has_cr_card', 'is_active_member', 'estimated_salary', 'exited']
# Remove unnecessary variables
df = df.drop(["RowNumber", "CustomerId", "Surname"], axis=1)
# Change the columns name
df.columns = col_name
# Show the data
df.head(5)
| credit_score | geography | gender | age | tenure | balance | num_of_products | has_cr_card | is_active_member | estimated_salary | exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 credit_score 10000 non-null int64 1 geography 10000 non-null object 2 gender 10000 non-null object 3 age 10000 non-null int64 4 tenure 10000 non-null int64 5 balance 10000 non-null float64 6 num_of_products 10000 non-null int64 7 has_cr_card 10000 non-null int64 8 is_active_member 10000 non-null int64 9 estimated_salary 10000 non-null float64 10 exited 10000 non-null int64 dtypes: float64(2), int64(7), object(2) memory usage: 859.5+ KB
The set contains information about 10,000 bank users
No empty values found
There are both nominative and quantitative characteristics
# Number of unique value for each variable
df.nunique()
credit_score 460 geography 3 gender 2 age 70 tenure 11 balance 6382 num_of_products 4 has_cr_card 2 is_active_member 2 estimated_salary 9999 exited 2 dtype: int64
2-Statistics and data visualisation¶
Bar plot of binary variables¶
plt.figure(figsize=(8, 8))
bar_columns = ["exited", "gender", "has_cr_card", "is_active_member"]
counter = 1
for key in bar_columns:
plt.subplot(2, 2, counter)
bar_plot(df[key], x_label=key, y_label="%", color=['indigo', 'gold'])
counter += 1
- Gender and activities among customers are evenly distributed.
- The proportion of customers holding a credit card is lower.
- The proportion of exiting customers is low.
Box plot of continuous variables¶
numeric_col = ['age', 'estimated_salary', 'balance', 'credit_score']
plt.figure(figsize=(10, 10))
counter = 1
for key in numeric_col:
plt.subplot(2, 2, counter)
sns.boxplot(x = df['exited'], y = df[key], palette=['indigo', 'gold'])\
.set(title=f'{key} by exited')
counter += 1
Correlation between numerical variables¶
# Correlation between numrical variable
plt.figure(figsize=(9, 6))
sns.heatmap(df.drop(['geography', 'gender'], axis=1).corr(), annot=True, cmap='viridis', linewidths=.5)
plt.title('Correlation Plot')
Text(0.5, 1.0, 'Correlation Plot')
- Based on the chart provided, there is no strong linear relationship observed between the variables.
Association between categorical variables¶
For this purpose, we will use Cramer's V. It measures how strongly two categorical fields are associated.
$$ \mathcal{V} = \sqrt{\frac{\chi^2/n}{min(r-1, c-1)}} $$
categorical_col = ["num_of_products", "gender", "has_cr_card", "is_active_member", "geography"]
score = {}
exited = df["exited"]
for col in categorical_col:
temp = df[col]
the_confusion_matrix = pd.crosstab(exited, temp).to_numpy()
score[col] = CramerV(the_confusion_matrix, True)
score = pd.Series(score, index=categorical_col).sort_values(ascending=False)
score.plot(kind="bar", grid=True, color=['#4B0082', '#4B008266', '#4B008299', '#4B0082CC'])
plt.xticks(rotation=50, ha='right')
plt.title('Exited and Categorical')
Text(0.5, 1.0, 'Exited and Categorical')
A Cramer's V value of 0.4 indicates a moderate level of association between
exitedandnum_of_products. The association is more substantial compared to the previous value, but it's still not very strong.The remaine variables suggests a relatively weak association with the
exitedvariable.
Conditional probabilities¶
#--------------------------------------------
# Calculating the continitional probabilities
#--------------------------------------------
probs = {}
for col in categorical_col:
uniques = df[col].unique()
for unique in uniques:
condition = df[col] == unique
probs[f'P(Exited|{col}={unique})'] = round(df.loc[condition, 'exited'].mean(), 2)
probs = pd.Series(probs).sort_values(ascending=False)
#--------------------------------------------
# plot the probabilities
#--------------------------------------------
colors = ["#FF0000", "#FF4500", "#FFA500", "#FFD700", "#FFFF00","#ADFF2F", "#7FFF00", "#00FF00",
"#00FA9A", "#00CED1", "#4682B4", "#0000FF", "#4B0082"]
probs.plot(kind="bar", grid = True, color = colors)
plt.xticks(rotation=50, ha='right')
plt.title('Probability of Exited given Categorical')
Text(0.5, 1.0, 'Probability of Exited given Categorical')
Customers with 4 or 3 products face a higher risk of exiting the bank compared to those with 1 or 2 products.
For the other variables, the risk of exiting is significantly low.
Distribution of continuous variables¶
fig, axes = plt.subplots(2,2, figsize = (9,9))
plt.subplots_adjust(hspace=0.5)
columns = df[numeric_col]
for i, column in enumerate(columns):
ax = axes[i // 2, i % 2]
sns.kdeplot(data = df,
x = column,
fill = True,
alpha = 0.5,
hue = 'exited',
palette = ['#6A5ACD', '#4B0082'],
ax = ax)
ax.set_xlabel(column, fontsize = 14)
plt.show()
- Given the distributions above, we observe that the distribution changes when clients exit or stay.
- We observe a heavy tail in the distributions, which implies higher probabilities for the extreme values.
- Clients over 70 and below 20 have a higher chance of not exiting..
plt.figure(figsize=(10, 9))
paired_list_numerical = list(combinations(numeric_col, 2))
counter = 1
for key in paired_list_numerical:
plt.subplot(3, 2, counter)
sns.scatterplot(x=key[0], y=key[1], hue="exited", palette=['#6A5ACD', "gold"], data=df)
counter += 1